Background subtraction based on Local Shape 



Jean-Philippe Jodoin Guillaume- Alexandre Bilodeau 

Nicolas Saunier 
Ecole Polytechnique de Montral 
P.O. Box 6079, Station Centre-ville, Montreal, (Quebec), Canada, H3C 3A7 

{ jean-philippe . jodoin, guillaume-alexandre .bilodeau, nicolas . saunier}@polymtl . ca 



Abstract 

We present a novel approach to background subtraction 
that is based on the local shape of small image regions. In 
our approach, an image region centered on a pixel is mod- 
eled using the local self- similarity descriptor. We aim at 
obtaining a reliable change detection based on local shape 
change in an image when foreground objects are moving. 
The method first builds a background model and compares 
the local self- similarities between the background model 
and the subsequent frames to distinguish background and 
foreground objects. Post-processing is then used to refine 
the boundaries of moving objects. Results show that this 
approach is promising as the foregrounds obtained are com- 
plete, although they often include shadows. 



1. Introduction 

Background subtraction methods are an important step 
in numerous computer vision systems. These methods are 
used to identify moving objects in a video stream, which 
is often the first step in complex systems such as activity 
recognition, object tracking, and motion capture. Extract- 
ing the moving objects can improve the reliability of the 
system by reducing the search space, reducing processing 
needs, and allowing the use of simpler technics for the rest 
of the data extraction. Needless to say, the quality of many 
computer vision systems directly depend on the quality of 
the background subtraction method used. 

Most background subtraction methods work at the pixel 
level like the classic single Gaussian method (3) and the 
Gaussian mixture model 0. The shortcomings of these 
methods are that they may be affected by noise and per- 
turbations in the image, as no notion of neighborhood con- 
sistency is used. This problem is difficult to solve at the 
pixel level and are why we developed a new local shape- 
based approach to background subtraction based on the Lo- 
cal Self-Similarity (LSS) descriptors g). 



Our approach is not unlike other region-based method 
like Q, but in our case, we use the LSS descriptor to find 
the foreground regions instead of using histogram and the 
color intensity of the pixels inside rectangular regions. We 
also use a simple post-processing step to refine the objects' 
boundary accuracy, as the descriptors cover regions that are 
larger than a pixel. Our post-processing does not include the 
removal of shadows and we did not yet consider dynamic 
backgrounds. 

2. Methodology 

2.1. Background model 

The first step in our method involves the creation of a 
background model. This model is a representation of the 
background with no foreground objects in it. The result- 
ing model will be a grid of local self-similarity descriptors 
and a background image. To build this model, we use a set 
of training frames, in which we calculate the self-similarity 
descriptors using default parameters [2] for each pixel. The 
descriptor, centered on the pixel is a log-polar representa- 
tion of a correlation surface resulting from comparing with 
a sum of square differences a 5x5 pixel patch inside a 41x41 
pixel patch. The correlation surface is expressed as an 80 
components self-similarity vector (20 angles, 4 radial in- 
tervals) (5). For the subsequent frames in the training set 
after the processing of the first frame, we calculate the self- 
similarity descriptors for all pixels and we calculate the Eu- 
clidean distance to the existing region descriptor for all cor- 
responding pixel positions. If the distance is below a thresh- 
old (a threshold of 1 was used), we increment a counter 
for that descriptor, otherwise we create a new descriptor 
for the region at this pixel and put the counter value to 1. 
This process is repeated as long as there are training frames 
available. If the camera is not static, or if the background 
is dynamic, there will be a lot of descriptors for a single 
image pixel. When all training frames are processed, the 
descriptor for a given pixel position with the highest count 
value (frequency) is selected as the background descriptor 
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of the region. We assume that every pixel will represent 
more frequently the background than anything else in the 
training dataset. If a static foreground object is part of more 
than half of the training frames, it will be part of the back- 
ground model. The final background model is composed 
of the LSS descriptors of each pixel and the pixel colors 
for the corresponding pixel. The pixel colors are kept for 
the post-processing part. The background model is static 
at the moment, but future research will aim to take into ac- 
count dynamic lighting condition and intermittent motion 
from moving objects. 

2.2. Foreground object detection 

To detect changes in a frame, we use a process similar 
to the one used for the creation of the background model. 
For a new frame, we calculate the Euclidean distance be- 
tween each pixel descriptor and the corresponding back- 
ground pixel descriptor, and if it is higher than a threshold 
(a threshold of 30 was used), the pixel is assumed to be part 
of the foreground. This gives a good estimate of the fore- 
ground object's position, but it tends to overestimate the size 
of the objects because of the way the LSS descriptor works. 
This is due to the fact that the local self-similarity descrip- 
tor correlates a 5x5 image patch with a larger surrounding 
region (41x41 pixel patch). Using too small region patch 
tends to make the descriptor less robust, so we kept the rec- 
ommended parameters of the algorithm. For this same rea- 
son, we added padding to the frames to be able to do the 
correlation with the larger pixel patch and avoid losing in- 
formation on the border. The padding is used in our method 
so that we can have a neutral effect on the correlation. We 
can calculate the size of the padding border to add to an 
image using 

Padding = b - b%3 (1) 

with 

b = r+p (2) 
where r is the radius around patch and p is the patch size. 

2.3. Post-processing 

The larger surrounding regions around the foreground 
objects reduce the precision of the method. There is also 
a small amount of noise due to dynamic background that 
changes the local self-similarity at a pixel. However, most 
of the foreground objects are complete. A result without 
any post-processing is shown in figure [T] To get more pre- 
cise boundaries of objects, a series of morphological opera- 
tions are applied. First, a closing is used to remove the holes 
in the foreground objects. After that, an erosion operation 
is performed on the foreground objects to remove as much 
noise as possible. Finally, a dilation technique is applied 
to the eroded objects, and subtracting the dilated objects 
from the eroded objects gives us an approximation of the 



Figure 1. LSS background subtraction without post-processing 




Figure 2. Object core and border 



boundaries of the objects. The dilation parameter should be 
adjusted in order to have a border with a size similar to the 
radius use in the local self-similarity calculation. The fore- 
ground objects are now separated in a core part and a border 
part (as shown in figure [2]). The core part will be used di- 
rectly in the foreground mask and the border part will need 
further refinements. 

To refine the border of the foreground object, we use a 
simple Euclidean distance between the color intensity of the 
border pixels and the color intensity of the corresponding 
pixel in the background model. If the distance between the 
two pixels is over a threshold (the threshold used was 30), 
we consider the pixel to be part of the foreground, and else 
it is part of the background. 

The resulting mask is then eroded to remove the noise 
in the border and a closing is applied to have a cleaner and 
more precise foreground mask. The result of this step is 
shown in figure [3] 
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Figure 3. LSS background subtraction after post-processing 



3. Results 

To compare our method to state-of-the-art methods, we 
have used the change detection datasets available from JT) 
and we have applied our method on three categories, one 
of them is the baseline dataset which is a scene with an al- 
most static background and a static camera. We have also 
applied our method to the shadow dataset, and the thermal 
dataset which respectively contains a picture sequence with 
prominent shadows and thermal imagery. The cameraJit- 
ter dataset, the dynamicBackground dataset and the inter- 
mittent ObjectMotion dataset will not be covered by this 
method because those situations are not handled at the mo- 
ment by the algorithm and they will be part of future work. 
For our data, we had four measures for each dataset, the 
number of true positive in the dataset (TP), the number of 
false positive (pixel detected as foreground that should have 
been detected as background) (FP), the number of false neg- 
ative (pixel detected as background that should have been 
detected as foreground) (FN) and the number of true neg- 
ative (TN). With these, we have calculated the following 
metrics as defined in HI : 



Recall 



TP 



TP + FN 
TN 
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(8) 



(9) 



Precision + Recall 

To calculate the rank of the methods in the tables, we 
calculated the rank for each method in each metric and the 
methods were sorted by the average rank of the metrics. The 
results of the other method are from (TJ. The values in the 
result table are the average value across all datasets from a 
category. As shown in table [T] the recall metric shows that 
our methods does not miss a lot of pixels from the mov- 
ing objects. It has a higher rate of false positive than other 
methods, but it still achieves a reasonable percentage of bad 
classification (PBC). 
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Table 1. Metrics for our method applied to the baseline dataset 
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Figure 4. Results for the baseline, shadow and thermal dataset 



The LSS descriptor is a good way to detect changes in 
images because moving objects result in a change in shape. 
The difficulty is refining the results to the pixel level. In 
this paper, the refinement is done with a simple Euclidean 
distance that is not adaptive, and simple morphological op- 
erations. Still, the method ranks reasonably well. Because 
shadows cause changes in the local correlation surface, they 
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Table 2. Metrics for our method applied to the shadow dataset 




Figure 5. Frame 362 of the shadow/bungalows dataset 
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Table 3. Metrics for our method applied to the thermal dataset 



are systematically detected. Furthermore, small holes in ob- 
jet (like the space between the legs, see the second row of 
figure [4|i are included in the foreground because they are 
smaller than the correlation surface size. However, this is 
beneficial within objects because perturbations at the pixel 
level or smaller than the correlation surface do not affect the 
detection. 

In table [2] the false positive rate increases significantly 
compared to the baseline dataset. This is due to the detec- 
tion of shadows as a new shape by the LSS descriptor. This 
effect is quite visible in figure [5] Shadows change the lo- 
cal correlation surface because details are less visible as the 
intensities gets darker. 

For the thermal dataset, our method had no problem to 
find all the moving parts and shows a high rate of recall. 
This is due to the fact that the moving objects (humans) 
boundaries are well defined in the thermal images. The ther- 
mal reflections of the humans are also very well defined as 



we can see in figure |4] The algorithm detects them as part 
of the body with an almost perfect symmetry. This explains 
the high level of false positive. A possible way to eliminate 
those reflections would be to combine the thermal camera in 
stereo with a visible camera, as it was already done by J6] 
using LSS. 

4. Conclusion and future work 

In this paper, we have used the LSS descriptor as a way 
to distinguish foreground objects from the background. We 
have successfully built a static model of the background and 
used a metric to determine if the pixel patches were part of 
the foreground or the background. After that, we used the 
color information and morphological operation to refine the 
model border. The use of LSS patches instead of individual 
pixel intensity provides some robustness to camera noise 
and small intensity change which provides more complete 
foreground objects. As a future direction for this work, we 
will be working on making the algorithm more resistant to 
small camera viewpoint change, long term change in pic- 
ture (such as a parked car moving on the background) and 
shadow removal. 

References 

[1] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konra, and P. Ishwar. 
Dataset for the IEEE workshop on change detection CPVR 
2012. 2012. 

[2] E. Horster, T. Greif, Lienhart, and M. Slaney. Comparing lo- 
cal feature descriptors in plsa-based image models. Springer 
Verlag, 5096:446-455, 2008. 

[3] S. McKenna, S. Jabri, Z. Duric, H. Wechsler, and A. Rosen- 
feld. Tracking groups of people. Computer Vision and Image 
Understanding, 80(l):42-56, 2000. 

[4] E. Shechtman and M. Irani. Matching Local Self-Similarities 
across Images and Videos. IEEE Conference on Computer 
Vision and Pattern Recognition, 25(3): 1-8, 2007. 

[5] C. Stauffer and W. Grimson. Adaptive background mix- 
ture models for real-time tracking. Proceedings IEEE Conf, 
2:246-252, 1999. 

[6] A. Torabi and G.-A. Bilodeau. Local self-similarity as 
a dense stereo correspondence measure for thermal-visible 
video registration. 2011 IEEE Computer Society Confer- 
ence on Computer Vision and Pattern Recognition Workshops 
(CVPR Workshops), pages 61-67, 2011. 

[7] P. D. Z. Varcheie, M. Sills-Lavoie, and G.-A. Bilodeau. A 
multiscale region-based motion detection and background 
subtraction algorithm. Sensors, 10(2): 1041-1061, 2010. 



4 



